Goto

Collaborating Authors

 correlation dimension



Correlation Dimension of Auto-Regressive Large Language Models

Du, Xin, Tanaka-Ishii, Kumiko

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved remarkable progress in natural language generation, yet they continue to display puzzling behaviors -- such as repetition and incoherence -- even when exhibiting low perplexity. This highlights a key limitation of conventional evaluation metrics, which emphasize local prediction accuracy while overlooking long-range structural complexity. We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify the epistemological complexity of text as perceived by a language model. This measure captures the hierarchical recurrence structure of language, bridging local and global properties in a unified framework. Through extensive experiments, we show that correlation dimension (1) reveals three distinct phases during pretraining, (2) reflects context-dependent complexity, (3) indicates a model's tendency toward hallucination, and (4) reliably detects multiple forms of degeneration in generated text. The method is computationally efficient, robust to model quantization (down to 4-bit precision), broadly applicable across autoregressive architectures (e.g., Transformer and Mamba), and provides fresh insight into the generative dynamics of LLMs.



Tailored minimal reservoir computing: on the bidirectional connection between nonlinearities in the reservoir and in data

Prosperino, Davide, Ma, Haochun, Räth, Christoph

arXiv.org Artificial Intelligence

We study how the degree of nonlinearity in the input data affects the optimal design of reservoir computers, focusing on how closely the model's nonlinearity should align with that of the data. By reducing minimal RCs to a single tunable nonlinearity parameter, we explore how the predictive performance varies with the degree of nonlinearity in the reservoir. To provide controlled testbeds, we generalize to the fractional Halvorsen system, a novel chaotic system with fractional exponents. Our experiments reveal that the prediction performance is maximized when the reservoir's nonlinearity matches the nonlinearity present in the data. In cases where multiple nonlinearities are present in the data, we find that the correlation dimension of the predicted signal is reconstructed correctly when the smallest nonlinearity is matched. We use this observation to propose a method for estimating the minimal nonlinearity in unknown time series by sweeping the reservoir exponent and identifying the transition to a successful reconstruction. Applying this method to both synthetic and real-world datasets, including financial time series, we demonstrate its practical viability. Finally, we transfer these insights to classical RC by augmenting traditional architectures with fractional, generalized reservoir states. This yields performance gains, particularly in resource-constrained scenarios such as physical reservoirs, where increasing reservoir size is impractical or economically unviable. Our work provides a principled route toward tailoring RCs to the intrinsic complexity of the systems they aim to model.


On consistent estimation of dimension values

Cholaquidis, Alejandro, Cuevas, Antonio, Pateiro-López, Beatriz

arXiv.org Machine Learning

The problem of estimating, from a random sample of points, the dimension of a compact subset S of the Euclidean space is considered. The emphasis is put on consistency results in the statistical sense. That is, statements of convergence to the true dimension value when the sample size grows to infinity. Among the many available definitions of dimension, we have focused (on the grounds of its statistical tractability) on three notions: the Minkowski dimension, the correlation dimension and the, perhaps less popular, concept of pointwise dimension. We prove the statistical consistency of some natural estimators of these quantities. Our proofs partially rely on the use of an instrumental estimator formulated in terms of the empirical volume function Vn (r), defined as the Lebesgue measure of the set of points whose distance to the sample is at most r. In particular, we explore the case in which the true volume function V (r) of the target set S is a polynomial on some interval starting at zero. An empirical study is also included. Our study aims to provide some theoretical support, and some practical insights, for the problem of deciding whether or not the set S has a dimension smaller than that of the ambient space. This is a major statistical motivation of the dimension studies, in connection with the so-called Manifold Hypothesis.


Controlling dynamical systems into unseen target states using machine learning

Köglmayr, Daniel, Haluszczynski, Alexander, Räth, Christoph

arXiv.org Artificial Intelligence

We present a novel, model-free, and data-driven methodology for controlling complex dynamical systems into previously unseen target states, including those with significantly different and complex dynamics. Leveraging a parameter-aware realization of next-generation reservoir computing, our approach accurately predicts system behavior in unobserved parameter regimes, enabling control over transitions to arbitrary target states. Crucially, this includes states with dynamics that differ fundamentally from known regimes, such as shifts from periodic to intermittent or chaotic behavior. The method's parameter-awareness facilitates non-stationary control, ensuring smooth transitions between states. By extending the applicability of machine learning-based control mechanisms to previously inaccessible target dynamics, this methodology opens the door to transformative new applications while maintaining exceptional efficiency. Our results highlight reservoir computing as a powerful alternative to traditional methods for dynamic system control.


Correlation Dimension of Natural Language in a Statistical Manifold

Du, Xin, Tanaka-Ishii, Kumiko

arXiv.org Artificial Intelligence

The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barab\'asi-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.


What is the $\textit{intrinsic}$ dimension of your binary data? -- and how to compute it quickly

Hanika, Tom, Hille, Tobias

arXiv.org Artificial Intelligence

Dimensionality is an important aspect for analyzing and understanding (high-dimensional) data. In their 2006 ICDM paper Tatti et al. answered the question for a (interpretable) dimension of binary data tables by introducing a normalized correlation dimension. In the present work we revisit their results and contrast them with a concept based notion of intrinsic dimension (ID) recently introduced for geometric data sets. To do this, we present a novel approximation for this ID that is based on computing concepts only up to a certain support value. We demonstrate and evaluate our approximation using all available datasets from Tatti et al., which have between 469 and 41271 extrinsic dimensions.


Scaling Laws in Jet Classification

Batson, Joshua, Kahn, Yonatan

arXiv.org Artificial Intelligence

We demonstrate the emergence of scaling laws in the benchmark top versus QCD jet classification problem in collider physics. Six distinct physically-motivated classifiers exhibit power-law scaling of the binary cross-entropy test loss as a function of training set size, with distinct power law indices. This result highlights the importance of comparing classifiers as a function of dataset size rather than for a fixed training set, as the optimal classifier may change considerably as the dataset is scaled up. We speculate on the interpretation of our results in terms of previous models of scaling laws observed in natural language and image datasets.


Controlling dynamical systems to complex target states using machine learning: next-generation vs. classical reservoir computing

Haluszczynski, Alexander, Köglmayr, Daniel, Räth, Christoph

arXiv.org Artificial Intelligence

Controlling nonlinear dynamical systems using machine learning allows to not only drive systems into simple behavior like periodicity but also to more complex arbitrary dynamics. For this, it is crucial that a machine learning system can be trained to reproduce the target dynamics sufficiently well. On the example of forcing a chaotic parametrization of the Lorenz system into intermittent dynamics, we show first that classical reservoir computing excels at this task. In a next step, we compare those results based on different amounts of training data to an alternative setup, where next-generation reservoir computing is used instead. It turns out that while delivering comparable performance for usual amounts of training data, next-generation RC significantly outperforms in situations where only very limited data is available. This opens even further practical control applications in real world problems where data is restricted.